SQL 为什么动不动就 N 百行以 K 计
SQL 困难的分析探讨
sales_amount | 销售业绩表 |
sales | 销售员姓名,假定无重名 |
product | 销售的产品 |
amount | 该销售员在该产品上的销售额 |
现在我们想知道出空调和电视销售额都在在前 10 名的销售员名单。
1. 按空调销售额排序,找出前 10 名;
2. 按电视销售额排序,找出前 10 名;
select top 10 sales from sales_amount where product='AC' order by amount desc
select top 10 sales from sales_amount where product='TV' order by amount desc
select * from
( select top 10 sales from sales_amount where product='AC' order by amount desc )
intersect
( select top 10 sales from sales_amount where product='TV' order by amount desc )
一个只三步的简单计算用 SQL 要写成这样,而日常计算中多达十几步的比比皆是,这显然超出来许多人的可接受能力。
create temporary table x1 as
select top 10 sales from sales_amount where product='AC' order by amount desc
2. 计算电视销售额前 10 名。类似地
create temporary table x2 as
select top 10 sales from sales_amount where product='TV' order by amount desc
3. 求交集,前面麻烦了,这步就简单些
select * from x1 intersect x2
分步后思路变清晰了,但临时表的使用仍然繁琐。在批量结构化数据计算中,作为中间结果的临时集合是相当普遍的,如果都建立临时表来存储,运算效率低,代码也不直观。
1. 将数据按产品分组,将每组排序,取出前 10 名;
select sales
from ( select sales,
from ( select sales,
rank() over (partition by product order by amount desc ) ranking
from sales_amount)
where ranking <=10 )
group by sales
having count(*)=(select count(distinct product) from sales_amount)
这样的 SQL,有多少人会写呢?
select sales
from ( select A.sales sales, A.product product,
(select count(*)+1 from sales_amount
where A.product=product AND A.amount<=amount) ranking
from sales_amount A )
where product='AC' AND ranking<=10
select sales
from ( select A.sales sales, A.product product, count(*)+1 ranking
from sales_amount A, sales_amount B
where A.sales=B.sales and A.product=B.product AND A.amount<=B.amount
group by A.sales,A.product )
where product='AC' AND ranking<=10
这样的 SQL 语句,专业程序员写出来也未必容易吧!而仅仅是计算了一个前 10 名。
employee | 员工表 |
name | 员工姓名,假定无重名 |
gender | 员工性别 |
select employee.gender,count(*)
from employee,
( ( select top 10 sales from sales_amount where product='AC' order by amount desc )
intersect
( select top 10 sales from sales_amount where product='TV' order by amount desc ) ) A
where A.sales=employee.name
group by employee.gender
仅仅多了一个关联表就会导致如此繁琐,而现实中信息跨表存储的情况相当多,且经常有多层。比如销售员有所在部门,部门有经理,现在我们想知道“好”销售员归哪些经理管,那就要有三个表连接了,想把这个计算中的 where 和 group 写清楚实在不是个轻松的活儿了。
select sales.gender,count(*)
from (…) // …是前面计算“好”销售员的SQL
group by sales.gender
显然,这个句子不仅更清晰,同时计算效率也会更高(没有连接计算)。
mov ax,3
mov bx,5
mul bx,7
add ax,bx
这样的代码无论书写还是阅读都远不如 3+5*7 了(要是碰到小数就更要命了)。虽然对于熟练的程序员也算不了太大的麻烦,但对于大多数人而言,这种写法还是过于晦涩难懂了,从这个意义上讲,FORTRAN 确实是个伟大的发明。
更多例子
这些问题本身应该也算不上很复杂,都是在日常数据分析中经常会出现的,但已经很难为 SQL 了。
计算不分步
select count(*) from employee where department='sales'
select count(*) from employee where department='sales' and native_place='Beijing'
select count (*) from employee
where department='sales' and native_place='Beijing' and gender='female'
常规想法:选出销售部人员计数,再在其中找出其中北京籍人员计数,然后再递进地找出女员工计数。每次查询都基于上次已有的结果,不仅书写简单而且效率更高。
with A as
(select name, department,
row_number() over (partition by department order by 1) seq
from employee where gender=‘male’)
B as
(select name, department,
row_number() over(partition by department order by 1) seq
from employee where gender=‘female’)
select name, department from A
where department in ( select distinct department from B ) and seq=1
union all
select name, department from B
where department in (select distinct department from A ) and seq=1
计算不分步有时不仅造成书写麻烦和计算低效,甚至可能导致思路严重变形。
集合无序
select name, birthday
from (select name, birthday, row_number() over (order by birthday) ranking
from employee )
where ranking=(select floor((count(*)+1)/2) from employee)
中位数是个常见的计算,本来只要很简单地在排序后的集合中取出位置居中的成员。但 SQL 的无序集合机制不提供直接用位置访问成员的机制,必须人为造出一个序号字段,再用条件查询方法将其选出,导致必须采用子查询才能完成。
select max (consecutive_day)
from (select count(*) (consecutive_day
from (select sum(rise_mark) over(order by trade_date) days_no_gain
from (select trade_date,
case when
closing_price>lag(closing_price) over(order by trade_date)
then 0 else 1 END rise_mark
from stock_price) )
group by days_no_gain)
无序的集合也会导致思路变形。
集合化不彻底
select * from employee
where to_char (birthday, ‘MMDD’) in
( select to_char(birthday, 'MMDD') from employee
group by to_char(birthday, 'MMDD')
having count(*)>1 )
select name
from (select name
from (select name,
rank() over(partition by subject order by score DESC) ranking
from score_table)
where ranking<=10)
group by name
having count(*)=(select count(distinct subject) from score_table)
缺乏对象引用
select A.*
from employee A, department B, employee C
where A.department=B.department and B.manager=C.name and
A.gender='male' and C.gender='female'
select * from employee
where gender='male' and department in
(select department from department
where manager in
(select name from employee where gender='female'))
where gender='male' and department.manager.gender='female'
select name, company, first_company
from (select employee.name name, resume.company company,
row_number() over(partition by resume. name
order by resume.start_date) work_seq
from employee, resume where employee.name = resume.name)
where work_seq=1
select name,
(select company from resume
where name=A.name and
start date=(select min(start_date) from resume
where name=A.name)) first_company
from employee A
没有对象引用机制和彻底集合化的 SQL,也不能将子表作主表的属性(字段值)处理。针对子表的查询要么使用多表连接,增加语句的复杂度,还要将结果集用过滤或分组转成与主表记录一一对应的情况(连接后的记录与子表一一对应);要么采用子查询,每次临时计算出与主表记录相关的子表记录子集,增加整体计算量(子查询不能用 with 子句了)和书写繁琐度。
SPL 的引入
A | B | |
1 | =employee.select(department=="sales") | =A1.len() |
2 | =A1.select(native_place=="Beijing") | =A2.len() |
3 | =A2.select(gender=="female") | =A3.len() |
A | B | C | |
1 | for employee.group(department) | =A1.group@1(gender) | |
2 | >if B1.len()>1 | =@|B1 |
A | |
1 | =employee.sort(birthday) |
2 | =A1((A1.len()+1)/2) |
对于以有序集合为基础的 SPL 来说,按位置取值是个很简单的任务。
任务 4
A | |
1 | =stock_price.sort(trade_date) |
2 | =0 |
3 | =A1.max(A2=if(close_price>close_price[-1],A2+1,0)) |
A | |
1 | =employee.group(month(birthday),day(birthday)) |
2 | =A1.select(~.len()>1).conj() |
SPL 可以保存分组结果集,继续处理就和常规集合一样。
任务 6
A | |
1 | =score_table.group(subject) |
2 | =A1.(~.rank(score).pselect@a(~<=10)) |
3 | =A1.(~(A2(#)).(name)).isect() |
使用 SPL 只要按思路过程写出计算代码即可。
任务 7
A | |
1 | =employee.select(gender=="male" && department.manager.gender=="female") |
A | |
1 | =employee.new(name,resume.minp(start_date).company:first_company) |
…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call xxxx(?,?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
...
重磅!开源SPL交流群成立了
简单好用的SPL开源啦!
为了给感兴趣的小伙伴们提供一个相互交流的平台,
特地开通了交流群(群完全免费,不广告不卖课)
需要进群的朋友,可长按扫描下方二维码
本文感兴趣的朋友,请转到阅读原文去收藏 ^_^